{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "703125c0-7f27-4ef6-ada2-b433879b0c97",
   "metadata": {},
   "source": [
    "# Synthetic Data Generation\n",
    "## Description\n",
    "\n",
    "This notebook demonstrates how to use the nemotron-4-340b-instruct model for synthetic data generation that is used in this blueprint.\n",
    "\n",
    "This uses the nvidia gear store data as a source of product data.\n",
    "\n",
    "It then creates a sample customer set and then creates a realistic order history based on the nvidia gear store data.\n",
    "\n",
    "## Usage Instructions\n",
    "1. Install the required libraries using pip.\n",
    "2. Run each cell sequentially to execute the notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7edd1357-a874-4348-a34d-614fa344a2bf",
   "metadata": {},
   "source": [
    "### Install requirements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f93e604a-04e6-4800-8a80-4b68c1491575",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Install requirements\n",
    "!pip install pandas\n",
    "!pip install --upgrade --quiet langchain-nvidia-ai-endpoints\n",
    "!pip install langchain"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b899a94-f4d5-414d-9b5d-3422c85d5a5f",
   "metadata": {},
   "source": [
    "### Set the NVIDIA API Key \n",
    "\n",
    "The nemotron-4-340b-instruct model is accessed from the NVIDIA API Catalog. In order to access it, you need to set a valid API Key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6db9484-2b47-45c8-92b3-235ebb2ed1d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import os\n",
    "if not os.environ.get(\"NVIDIA_API_KEY\", \"\").startswith(\"nvapi-\"):\n",
    "    nvidia_api_key = getpass.getpass(\"Enter your NVIDIA API key: \")\n",
    "    assert nvidia_api_key.startswith(\"nvapi-\"), f\"{nvidia_api_key[:5]}... is not a valid key\"\n",
    "    os.environ[\"NVIDIA_API_KEY\"] = nvidia_api_key"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cedc2be6-4a0f-4851-9bf5-14a100372203",
   "metadata": {},
   "source": [
    "## Re-usable functions\n",
    "\n",
    "This function is used to generated the data. Data returned is in the form of a json object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36854645-32f9-438b-b499-5c79461694e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_do_task(llm, template, input_variables=[], input_values={}):\n",
    "    from langchain_core.output_parsers import JsonOutputParser    \n",
    "    parser = JsonOutputParser()\n",
    "    task_template = PromptTemplate(\n",
    "        input_variables=input_variables,\n",
    "        template=template\n",
    "    )\n",
    "    synthetic_data_chain = task_template | llm | parser \n",
    "    response = synthetic_data_chain.invoke(input_values)\n",
    "    return response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05310e58-a60b-495d-b803-59ceefee1a83",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "unique_numbers = set()\n",
    "def get_random_number(low, high):\n",
    "    # Loop until we have a certain number of unique random numbers\n",
    "    while True:\n",
    "        random_number = random.randint(low, high)\n",
    "        if random_number not in unique_numbers:\n",
    "            unique_numbers.add(random_number)\n",
    "            break\n",
    "    return random_number"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02965cf-60d0-4f22-aed3-79d593b1598a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import random\n",
    "import getpass\n",
    "import os\n",
    "import langchain\n",
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "from langchain_nvidia_ai_endpoints import ChatNVIDIA\n",
    "llm = ChatNVIDIA(model=\"nvidia/nemotron-4-340b-instruct\", temperature=1.0, max_tokens=2048)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1e7e41c-0f55-49b6-a65e-e928cd7c4e14",
   "metadata": {},
   "source": [
    "## Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6504ee7f-8408-475c-aad2-f264efe8a1bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "CUSTOMER_FILENAME=\"./customers.csv\"\n",
    "PRODUCT_FILENAME=\"../data/gear-store.csv\"\n",
    "ORDERHISTORY_FILENAME=\"./orders.csv\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13b529b2-cee0-4339-a09d-a97b50639a6c",
   "metadata": {},
   "source": [
    "## Customer Profile Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5e779ba-a00f-4c48-aa57-610a84c2b2a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "customer_template = '''\n",
    "Create 10 rows of data representing a customer in a database.\n",
    "The customer table has the following schema. \n",
    "\n",
    "  'FNAME': This is the customer's first name, \n",
    "  'LNAME': This is the customer's last name,\n",
    "  'AGE': age,\n",
    "  'GENDER': gender. this should me male or female,\n",
    "  'STATE': state in united states\n",
    "  'ZIPCODE': zipcode and this should be present in the state mentioned above,\n",
    "  'INTERESTS': this is any 2-3 interests \n",
    "  'MEMBER_SINCE': This is a date between 1st Jaunary 2015 and 30th June 2024\n",
    "\n",
    "\n",
    "Ensure the first name and last names are unique pairs in the dataset created.\n",
    "Return the data in the form of json array of customer rows.\n",
    "Do not include any niceties. \n",
    "Do you not add any more attributes.\n",
    "Return the data in the form of a json array that be loaded into a python variable with json.loads(..). Do not have the \"json\" work in the returned string\n",
    "'''\n",
    "customers = get_do_task(llm, customer_template,input_variables=[], input_values={})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b0f362f-90c5-41e6-ba1e-8cd27cc34467",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import csv\n",
    "# Specify the CSV file name\n",
    "csv_file = CUSTOMER_FILENAME\n",
    "# Initialize an empty set to store unique random numbers\n",
    "unique_numbers = set()\n",
    "\n",
    "for customer in customers:\n",
    "    for key, value in customer.items():\n",
    "        if isinstance(value, str):\n",
    "            customer[key] = value.strip()\n",
    "        elif isinstance(value, list):\n",
    "            customer[key] = ', '.join(value).strip()\n",
    "    customer['CID'] = get_random_number(100,10000)\n",
    "\n",
    "fieldnames = ['CID'] + [key for key in customers[0].keys() if key != 'CID']\n",
    "        \n",
    "# Write the data to a CSV file\n",
    "with open(csv_file, mode='w', newline='') as file:\n",
    "    writer = csv.DictWriter(file, fieldnames=fieldnames)\n",
    "    writer.writeheader()  # Write the header row\n",
    "    for customer in customers:\n",
    "        writer.writerow(customer)\n",
    "\n",
    "print(f\"Data successfully written to {csv_file}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfa1997b-7238-46fe-a00d-2fe1440a1315",
   "metadata": {},
   "source": [
    "## Order History Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ced66b69-c5d3-4d5e-9c27-378544484a00",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the data frames \n",
    "# 1. Electronics \n",
    "# 2. Other stuff\n",
    "# 3. Ensure you don't use the gift card \n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "# Load the CSV file\n",
    "file_path = PRODUCT_FILENAME\n",
    "df = pd.read_csv(file_path)\n",
    "\n",
    "# Filter out any products that contain the word \"Gift card\" in the name column\n",
    "df_filtered = df[~df['name'].str.contains('Gift card', case=False, na=False)]\n",
    "\n",
    "# First DataFrame: NVIDIA Electronics category (excluding gift cards)\n",
    "df_nvidia_electronics = df_filtered[df_filtered['category'] == 'NVIDIA Electronics'][['name', 'description', 'price']]\n",
    "\n",
    "# Second DataFrame: All other categories (excluding gift cards)\n",
    "df_other = df_filtered[df_filtered['category'] != 'NVIDIA Electronics'][['name', 'description', 'price']]\n",
    "\n",
    "# Display both DataFrames\n",
    "print(\"NVIDIA Electronics DataFrame:\")\n",
    "print(df_nvidia_electronics)\n",
    "\n",
    "print(\"\\nOther Categories DataFrame:\")\n",
    "print(df_other)\n",
    "\n",
    "NUM_ELECTRONIC_PRODUCTS = len(df_nvidia_electronics)\n",
    "NUM_OTHER_PRODUCTS = len(df_other)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c6e4a4f-6986-4a01-bf80-307289a4c7ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "order_template = '''\n",
    "    You will be given information regarding 10 products in an array where each array element contains the \n",
    "    product name, product description and product price. You will also be given the current date. Use these things to do the following:\n",
    "    \n",
    "    You will need to create an order history table containing 10 rows of data representing\n",
    "    where each row maps to one of the 10 products you were given.\n",
    "    Make the data interesting i.e. don't usually stick to delivered but do returns and return rejects \n",
    "    Return the data in the form of json object\n",
    "    Do not include any niceties. \n",
    "    Do you not add any more attributes.\n",
    "    Return the data in the form of a json array that be loaded into a python variable with json.loads(..).\n",
    "    Do not have the word \"json\" in the returned string\n",
    "    \n",
    "    You will be given the product description to use in the creation of the ReturnReason. \n",
    "    \n",
    "    Product Description: {product_desc}\n",
    "    Current Date: {current_date}\n",
    "    \n",
    "    This schema of the data is presented like this:- \n",
    "    <attribute> <type> <description>\n",
    "    product_name STRING \"Product name\"\n",
    "    product_description STRING \"Product description\"\n",
    "    OrderDate DATETIME \"Date and time when the order was placed\"\n",
    "    Quantity INT \"Number of units ordered\" \n",
    "    OrderAmount INT \"Product Price X Quantity\"\n",
    "    OrderStatus\tVARCHAR(50)\t\"Status of the order\" \n",
    "    ReturnStatus\tVARCHAR(50)\t\"Status of the return\" \n",
    "    ReturnStartDate\tDATETIME \"Date when the return was started\" \n",
    "    ReturnReceivedDate DATETIME\t\"Date when the return was receive\" \n",
    "    ReturnCompletedDate\tDATETIME \"Date when the return was completed\" \n",
    "    ReturnReason VARCHAR(255) \"Reason for the return\" \n",
    "    Notes VARCHAR(255) \"Notes sent to the customer\" \n",
    "    \n",
    "    Instructions to set the various attributes:\n",
    "    * Keep the current date in mind when generating the data.\n",
    "    * product_name: This should match the product name from the input that this record is generated for.\n",
    "    * product_description: This should match the product description input that this record is generated for.\n",
    "    * OrderDate: This should be a date between October 1st 2024 and October 20th 2024. \n",
    "    * Quantity: This should be a number between 1 and 8\n",
    "    * OrderAmount: This is a multiple of the Quantity and the Product price in the input. \n",
    "    * OrderStatus: This should be one these values [Pending, Processing, Shipped, In Transit, Out for Delivery, Delivered, Cancelled, Returned, Return Requested, Delayed, On Hold].\n",
    "    * ReturnStatus: This should be set to one of these values [None, Requested, Approved, Rejected, Received, Processing, Refunded, Pending Approval, Return to Sender, Awaiting Customer Action]\n",
    "      This should be set ONLY when OrderStatus is set to either \"Returned\" or  \"Return Requested\" else set to None\n",
    "    \n",
    "    * ReturnStartDate: This is set only when the OrderStatus is \"Return Requested\". It should be a date should be within 7 days after the OrderDate but before the \"current date\" else set to None.\n",
    "    * ReturnReceivedDate: This should be set to a date within 5 days after the ReturnStartDate else set to None\n",
    "    * ReturnCompletedDate: This should be a date when the return was completed and set only when the OrderStatus field is set to \"Returned\" and ReturnStatus field set to \"Approved\".\n",
    "                           This should be a minimum of 15 and 30 days after the ReturnReceivedDate else set to None\n",
    "    * ReturnReason: This should be set to a creative reason that would make logical sense based on the product description only when the OrderStatus is set to \"Returned\" or \"Return Requested\"\n",
    "      This should be set to something when the ReturnStatus is not None. \n",
    "    * Notes: This is information sent back to the customer. This should be something that is relevant and makes sense for the \n",
    "             product when the ReturnStatus is set to \"Rejected\". It can also be used to put notes if the order has been in processing state for longer than usual.\n",
    "             \n",
    "    \n",
    "    In the returned string:-\n",
    "    Do not include any niceties. \n",
    "    Do you not add any more attributes.\n",
    "    Return the data in the form of a json array that be loaded into a python variable with json.loads(..). Do not have the word \"json\" in the returned string\n",
    "    '''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c45ce1b-cdaa-4d53-a224-75e4cdd4fd05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a table of 10 products with 7 products being technical and 3 products being \"other\"\n",
    "#NUM_ELECTRONIC_PRODUCTS = len(df_nvidia_electronics)\n",
    "#NUM_OTHER_PRODUCTS = len(df_other)\n",
    "def get_random_products():\n",
    "    import pandas as pd\n",
    "    # Randomly pick 7 rows from the DataFrame and select only the name and description columns\n",
    "    random_rows = df_nvidia_electronics.sample(n=7)[['name', 'description', 'price']]\n",
    "    # Convert the selected DataFrame rows to a regular array (list of lists)\n",
    "    random_rows_json = random_rows.to_dict(orient='records')\n",
    "    # Display the array    \n",
    "    product_rows = random_rows_json    \n",
    "    random_rows = df_other.sample(n=3)[['name', 'description', 'price']]\n",
    "    # Convert the selected DataFrame rows to a regular array (list of lists)\n",
    "    random_rows_json = random_rows.to_dict(orient='records')\n",
    "    # Display the array\n",
    "    product_rows.append(random_rows_json)\n",
    "    return product_rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1110edbe-3b5d-4309-8951-99313a708a72",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Delete the order history file if it exists\n",
    "import os\n",
    "file_path = ORDERHISTORY_FILENAME\n",
    "\n",
    "# Check if the file exists and delete it\n",
    "if os.path.exists(file_path):\n",
    "    os.remove(file_path)\n",
    "    print(f\"File '{file_path}' has been deleted.\")\n",
    "else:\n",
    "    print(f\"File '{file_path}' does not exist.\")\n",
    "\n",
    "# Now read the customer data file one row at a time\n",
    "import pandas as pd\n",
    "\n",
    "# Load the CSV file\n",
    "file_path = CUSTOMER_FILENAME\n",
    "dfcustomer = pd.read_csv(file_path)\n",
    "\n",
    "# Initialize an empty set to store unique random numbers\n",
    "unique_numbers = set()\n",
    "\n",
    "# Loop over each row in the DataFrame\n",
    "for index, row in dfcustomer.iterrows():\n",
    "    random_prod_list = get_random_products()\n",
    "    print(row['CID'])\n",
    "\n",
    "    order_history = get_do_task(llm, order_template, input_variables = ['product_desc', 'current_date'], \n",
    "                                    input_values = {'product_desc': random_prod_list, 'current_date': \"23rd October 2024\"})\n",
    "    df = pd.DataFrame(order_history)\n",
    "\n",
    "    # Add the Customer ID (CID) as a new column\n",
    "    df['CID'] = row['CID']\n",
    "    for index, row in df.iterrows():\n",
    "        df.at[index, 'OrderID'] = int(get_random_number(10,100000))\n",
    "    df['OrderID'] = df['OrderID'].astype(int)\n",
    "\n",
    "    # Reorder columns to ensure 'CustomerID' is the first column\n",
    "    columns = ['CID', 'OrderID'] + [col for col in df.columns if col not in  ['CID', 'OrderID']]\n",
    "    df = df[columns]\n",
    "    # Append DataFrame to CSV file\n",
    "    df.to_csv(ORDERHISTORY_FILENAME, mode='a', index=False, header=not pd.io.common.file_exists(ORDERHISTORY_FILENAME))\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ecad68e-a75d-4de1-b85e-3f21c90eaf11",
   "metadata": {},
   "source": [
    "#### Copy over the csv files to the \"data\" folder\n",
    "\n",
    "If you are satisfied with the csv files generated, copy them over to the data folder for ingestion."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
